Statistical Learning Project

Introduction

Importing Dataset and uploading libraries

library(imputeMissings)
library(rmdformats)
library(GGally)
library(circlize)
library(psych)
library(pscl)
library(dendextend)
library(sjPlot)
library(gridExtra)
library(fclust)
library(FactoMineR)
library(class)
library(cluster)
library(factoextra)
library(tidyverse)
library(ggpubr)
library(caret)
library(car)
library(lawstat)
library(lmtest)
library(MASS)
library(readxl)

Efige <- read_excel("/Users/saralattarulo/Desktop/Project/data_orig.xlsx", 
    sheet = "all",na = "9999999999")

VARIABLES

European Firms in a Global Economy (EFIGE), is a dataset containing information about a sample of 14758 firms for a total of 489 qualitative and quantitative variables (obtained by the survey questions) in seven EU economies:

-Germany

-France

-Italy

-Spain

-United Kingdom

-Austria

-Hungary

Data was collected in 2010, covering the years from 2007 to 2009.

Being this dataset very huge, we will need to face many issues such as dealing with NA’s, outliers and establishing which will be the most significant variables for our aims.

I selected 28 variables focusing on the first issue that we could bump: the time-series. Since this data was collected by taking in account a range of years, to avoid Time-series problems, all my variables are selected on year 2008. As a consequence the section C about investments on R&D activities has been excluded.

Section A - STRUCTURE OF THE FIRM:

COUNTRY: country of the firm (categorical) country

PAVITT: Classification of the firm (categorical) pavitt

A1: year of establishment of the firm (categorical) age

A2A: urnover made by core product (%) core_p

A3: Annual turnover (categorical)turnov

A16: Foreign affiliates(abs) foreign_affiliates

A20: CEO is a family member family (categorical) fam_ceo

A16_2: First shareholder: share of capital(%) first_share

SECTION B - WORKFORCE

B3: Total number of employees (abs) empl

b4_2_1: Enterpreneurs/executives not related to the family who owns the company (abs) no_family_n

b4_2_2: Enterpreneurs/executives related to the family who owns the company (abs) Family_n

b4_2_2: White collars (abs) white_n

b4_2_4: Skilled blue collars (abs) skilled_n

b4_2_5: Unkilled blue collars (abs) skilled_n

B5_2: Total number of employees have been involved in R&D activities? (abs) involv_n

B6_2: Total number of of university graduates in your workforce in your home country (abs) grad_n

B7_2: Total number of foreign employees in your workforce in your home country (abs) foreign_empl_n

B22: employers that did training programs (%) training

B23: Training type (categorical) training_type

Section D – INTERNATIONALIZATION

D6: Total exporter countries(abs) exp_richness

Section F - FINANCE

F0: External financing (categorical) external_financing

F9: Number of banks f9

F10: Your firm’s total bank debit is held at your main bank (%) f10

F11: How many years has this bank been the firm’s main bank f11

Section E - MARKET & PRICING

E1: Firm’s turnover was made up by sales of produced-to order goods (%) po

Extra variables

Global Exporter: firm exporting to China or India or other asian countries in the USA or Canada or Central or south America (categorical)

mkt_innov: Firms that carried out new to market innovation (categorical)

family_tmt: family people forming the business (%)

After uploading the dataset, since we have variables of different kinds (even among numericals), I will upload some useful functions such as the Standardization, Normalization and one to remove outliers

# Removing outliers
remove_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}

# Population standard deviation function
PopSD= function (x){
  sum((x-mean(x))^2)/length(x)
}
# Variable standardizer function
Standardize = function(x){(x-mean(x))/PopSD(x)}

# Normalization
normalize <- function(x){
return ((x-min(x))/(max(x)-min(x)))
}

Variables that we are taking into account are selected step by step according to the study that we are implementing.

LET’S START

Multiple linear Regression

Thanks to the Multiple linear regression, we are able to predict a numerical outcome not using just one predictor, but more and more!

In our case, we are trying to understand if the number of Foreign employers in an Italian and French firm, depends or not on the following variables:

  • Total exporter countries;

  • Employers involved in research and development;

  • Percentage of turnover of the firm related to produced-to-other goods;

  • If the firm carried out any market innovation;

To represent the causal path underneath the model i am using DAG (Directed Acyclic Graphs).

Extract the subset and for this time I will neglect NA since we are loosing few observations in this case.

Regression<- Efige %>%
    dplyr::select(foreign_empl_n,exp_richness, no_family_n,involv_n,mkt_innov,po) %>%
   dplyr:: mutate(country=as.factor(country),
         mkt_innov=as.factor(mkt_innov),)%>%
    filter(country=="ITA"|country=="FRA")

Regression<-na.omit(Regression)
head(Regression)
## # A tibble: 6 x 7
##   foreign_empl_n exp_richness no_family_n involv_n mkt_innov    po country
##            <dbl>        <dbl>       <dbl>    <dbl> <fct>     <dbl> <fct>  
## 1              0           25       14          17 0           100 FRA    
## 2              0            5        1           1 0           100 FRA    
## 3              0            4        4.62        0 0           100 FRA    
## 4              3            3        3           2 0           100 FRA    
## 5              0            3        0           0 0           100 FRA    
## 6              1            3       25          40 0           100 FRA
psych::describe(Regression)
##                vars    n  mean    sd median trimmed  mad min  max range  skew
## foreign_empl_n    1 3048  3.12 13.61      1    1.23 1.48   0  500   500 21.90
## exp_richness      2 3048 10.91 14.43      6    7.96 5.93   1  200   199  4.16
## no_family_n       3 3048  6.05 23.30      1    2.17 1.48   0  700   700 14.64
## involv_n          4 3048  5.42 47.49      2    2.18 2.97   0 2500  2500 47.86
## mkt_innov*        5 3048  1.40  0.49      1    1.37 0.00   1    2     1  0.42
## po                6 3048 83.41 31.60    100   91.39 0.00   0  100   100 -1.82
## country*          7 3048  3.76  1.48      5    3.82 0.00   2    5     3 -0.35
##                kurtosis   se
## foreign_empl_n   672.43 0.25
## exp_richness      29.52 0.26
## no_family_n      326.87 0.42
## involv_n        2495.61 0.86
## mkt_innov*        -1.82 0.01
## po                 1.77 0.57
## country*          -1.88 0.03

Pre-processing the data: standardizing predictors because we want to give the same weight to all variables in the model

Regression$foreign_empl_n<-Standardize(Regression$foreign_empl_n)
Regression$exp_richness<-Standardize(Regression$exp_richness)
Regression$po<-Standardize(Regression$po)
Regression$involv_n<-Standardize(Regression$involv_n)

head(Regression)
## # A tibble: 6 x 7
##   foreign_empl_n exp_richness no_family_n involv_n mkt_innov     po country
##            <dbl>        <dbl>       <dbl>    <dbl> <fct>      <dbl> <fct>  
## 1      -0.0168         0.0677       14     0.00514 0         0.0166 FRA    
## 2      -0.0168        -0.0284        1    -0.00196 0         0.0166 FRA    
## 3      -0.0168        -0.0332        4.62 -0.00240 0         0.0166 FRA    
## 4      -0.000627      -0.0380        3    -0.00152 0         0.0166 FRA    
## 5      -0.0168        -0.0380        0    -0.00240 0         0.0166 FRA    
## 6      -0.0114        -0.0380       25     0.0153  0         0.0166 FRA

At first glance we notice that e don’t have linear relationships among variables; so for sure we will make some transformations on the variables.

Reg=Regression[,-5]
pairs(Reg[,-6])

Multiple linear Regression Model

After many trials I came up to the solution of transforming my model into a Log-lin model (with the exception of involv_n that i also transformed in log)

MLR=lm(log(foreign_empl_n)~ exp_richness + po + log(involv_n) + mkt_innov, data = Regression) 
summary(MLR)
## 
## Call:
## lm(formula = log(foreign_empl_n) ~ exp_richness + po + log(involv_n) + 
##     mkt_innov, data = Regression)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2957 -0.9856 -0.0254  0.7803  3.4144 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.83547    0.43045  -4.264 3.72e-05 ***
## exp_richness   2.26741    0.88978   2.548  0.01193 *  
## po             0.41367    3.02275   0.137  0.89135    
## log(involv_n)  0.22960    0.07076   3.245  0.00148 ** 
## mkt_innov1    -0.39157    0.21598  -1.813  0.07202 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.251 on 137 degrees of freedom
##   (2906 observations deleted due to missingness)
## Multiple R-squared:  0.154,  Adjusted R-squared:  0.1293 
## F-statistic: 6.236 on 4 and 137 DF,  p-value: 0.0001221
par(mfrow=c(2,2))
plot(MLR)

Parameters interpretations

Other predictors constant, we can say:

2.26741 is the expected mean change of log of foreign employers in terms of the number of standard deviations that we expect to observe when total exporter countries changes by 1 Standard deviation

In this case it has a positive effect and is significant since the p-value is less than 0.05;

0.41367 is the mean change of log of foreign employers in terms of the number of standard deviations that we expect to observe when the percentage of turnover made up by produced-to-other goods changes by 1 Standard Deviation.

In this case it has a positive effect, but it’s not significant at all since the pvalue is way greater than 0.05; 0.22960 is the mean change of log of foreign employers in terms of number of standard deviations that we expect to observe when log of people involved in Research and development activities changes by 1 Standard Deviation,

In this case it has a positive effect and it’s very significant since the p-value is 0.00148;

-0.39157 is the average change in the log of foreign employers when firms carried out new market innovations. It has a negative effect and actually it’s not really significant if we take in account a 0.05 level of significance.

The adjusted R-squared may be low for many reasons, but in this case it’s normal seeing it very low because we are working on a huge dataset, with more than 100 omitted variables that may be significant.. so that’s fair.

Let’s proceed with our assumptions check.

Assumptions

  • Linearity assumption
crPlots(MLR)

Every parameter seems to have an approximately linear relationship with the outcome

  • Mean of our error term is zero and normally distributed
mean(MLR$residuals) 
## [1] 2.053399e-17
qqnorm(MLR$residuals)

shapiro.test(MLR$residuals) 
## 
##  Shapiro-Wilk normality test
## 
## data:  MLR$residuals
## W = 0.9852, p-value = 0.1307

The mean of our residuals is equal to 0 and our residuals are normally distribuited

  • Homoscedasticity of residuals: the variance must be constant
lmtest::bptest(MLR) 
## 
##  studentized Breusch-Pagan test
## 
## data:  MLR
## BP = 2.3463, df = 4, p-value = 0.6724

Since the p-value is greater than 0.05, we can asset the null hypothesis for which the variance of residuals is constant

  • Independence of errors
lmtest::dwtest(MLR)
## 
##  Durbin-Watson test
## 
## data:  MLR
## DW = 1.8656, p-value = 0.2114
## alternative hypothesis: true autocorrelation is greater than 0

The p-value here appears to be higher than 0.05, so we can asset the null hypothesis for which there is independence of errors.

  • The x variables and residuals are uncorrelated

Take in account Endogeneity problem in this case.

#H0: corr coefficient equal to 0
#length(Regression$exp_richness)
#length(MLR$residuals)

#cor(Regression$exp_richness, MLR$residuals)
#cor(Regression$no_family_n, MLR$residuals)
#cor(Regression$involv_n, MLR$residuals)
  • Absence of perfect multicollinearity
pairs(Regression)

car::vif(MLR) #VIF
##  exp_richness            po log(involv_n)     mkt_innov 
##      1.282009      1.201141      1.054852      1.027426

To have a first idea about the multicollinearity among variables we can plot them one against the other. From a preliminary point of view we can see that we have absence of multicollinearity.

To be sure we might check the VIF (Variance Inflaction Factor), that usually indicates a model without multicollinearity when it’s less than 2.

As we see, there is no multicollinearity, therefore the assumption is met.

Logistic Regression

Thanks to Logistic regression, we are able to predict a categorical target with more than one regressor.

In our case, we are trying to understand if the probility of being in the group of firm exporting to China or India or other asian countries in the USA or Canada or Central or south America depends on:

  • Country of the firm;

  • If the CEO is a family member or not;

  • If the firm carried out market innovations

  • People involved in R&D;

  • Number of graduated employers in the firm.

In this case after extracting the sample we input our NA with the median/mode method using the function impute(), in order to not loose information.

Logistic<-Efige %>%
  dplyr::select(pavitt,age,country,mkt_innov,Global_exporter,fam_ceo,involv_n,grad_n)%>%
  filter(pavitt=="Specialized industries")%>%
 dplyr:: mutate(pavitt=as.factor(pavitt),
         age=as.factor(age),
         mkt_innov=as.factor(mkt_innov),
         country=as.factor(country),
         fam_ceo=as.factor(fam_ceo))


data<-imputeMissings::compute(Logistic, method = "median/mode")
Logistic<-imputeMissings::impute(Logistic, object = data)
Logistic <- na.omit(Logistic)

head(Logistic)
##                   pavitt                  age country mkt_innov Global_exporter
## 1 Specialized industries     More than 20 yrs     AUT         1               1
## 2 Specialized industries     More than 20 yrs     AUT         0               0
## 3 Specialized industries     More than 20 yrs     AUT         1               1
## 4 Specialized industries     More than 20 yrs     AUT         0               1
## 5 Specialized industries     More than 20 yrs     AUT         0               0
## 6 Specialized industries Between 20 and 6 yrs     AUT         1               0
##   fam_ceo involv_n grad_n
## 1       0    200.0   35.0
## 2       1      0.0    0.0
## 3       1     31.5   10.5
## 4       1      0.0    0.0
## 5       1      0.0    0.0
## 6       1      2.0    2.0

Logistic Model

After standardizing our numerical variables we are ready to build our model:

Logistic$involv_n=Standardize(Logistic$involv_n)
Logistic$grad_n=Standardize(Logistic$grad_n)

LR <- glm(Global_exporter ~ country + fam_ceo + mkt_innov + involv_n*grad_n, data=Logistic, family="binomial")
summary(LR)
## 
## Call:
## glm(formula = Global_exporter ~ country + fam_ceo + mkt_innov + 
##     involv_n * grad_n, family = "binomial", data = Logistic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3605  -0.9375  -0.7027   1.1936   2.3409  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.97704    0.28471  -3.432  0.00060 ***
## countryFRA        0.14154    0.29260   0.484  0.62857    
## countryGER        0.08751    0.28735   0.305  0.76072    
## countryHUN       -1.29254    0.42043  -3.074  0.00211 ** 
## countryITA        0.85191    0.28993   2.938  0.00330 ** 
## countrySPA        0.49326    0.29401   1.678  0.09341 .  
## countryUK         0.43084    0.29849   1.443  0.14891    
## fam_ceo1         -0.23380    0.08938  -2.616  0.00890 ** 
## mkt_innov1        0.81650    0.08671   9.416  < 2e-16 ***
## involv_n          5.09459    1.19240   4.273 1.93e-05 ***
## grad_n            8.83055    2.01994   4.372 1.23e-05 ***
## involv_n:grad_n -16.09100    3.50387  -4.592 4.38e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3546.5  on 2664  degrees of freedom
## Residual deviance: 3247.9  on 2653  degrees of freedom
## AIC: 3271.9
## 
## Number of Fisher Scoring iterations: 4
par(mfrow=c(2,2))
plot(LR)

Parameters interpretations

Having AUT as reference and considering the comparison with HUN there is a statistically significant negative effect (-1.29254) on the global_exporter.

Having AUT as reference and considering the comparison with ITA there is a statistically significant positive effect (0.85191) on the global_exporter.

Having fam_ceo0 (the CEO is not from family) as reference and considering the comparison with fam_ceo1 (the CEO is from family) there is a statistically significant negative effect (-0.23380 ) on the global_exporter.

Having mkt_innov0 (no market innovation) as reference and considering the comparison with mkt_innov1 (market innovation) there is a statistically significant positive effect (0.81650) on the global_exporter.

involv_n: 5.09459 is the average change in the logit when involv_n change by 1 unit

grad_n: 8.83055 is the average change in the logit when grad_n change by 1 unit

Moderation effect

involv_n:grad_ : -16.1% is the average change in the logit involv_n:grad_n change by 1 unit.

It is interesting to see that if separated, the number of employers involved in R&D and the number of graduates employers have positive impact, while if we consider the number of graduates as a moderator of the number of the employers involved in R&D we have a significant negative effect.

ODDS interpretation

As an odd interpretation instead, we try to understand what has happened to the odd, changing it by 1 unit and taking in account the proportional rate at which the predicted odds ratio changes with each successive unit of x (how much the average odds ratio varies when an unit of independent variable)

(exp(coef(LR))-1)*100
##     (Intercept)      countryFRA      countryGER      countryHUN      countryITA 
##      -62.357708       15.204523        9.145313      -72.542814      134.411796 
##      countrySPA       countryUK        fam_ceo1      mkt_innov1        involv_n 
##       63.764065       53.854495      -20.847942      126.257303    16213.650236 
##          grad_n involv_n:grad_n 
##   683908.140801      -99.999990

countryHUN = -72.5% decrease in the odds (for a one point increase in HUN score) of being in the firms that exported.

countryITA = 134.4 % increase in the odds (for a one point increase in ITA score) of being in the firms that exported.

fam_ceo1 = -20.8% decrease in the odds (for a one point increase in fam_ceo1 score) of being in the firms that exported.

mkt_innov1 = 126.2% decrease in the odds (for a one point increase in mkt_innov1 score) of being in the firms that exported.

involv_n: 16213% increase in the odds (for a one point increase in involv_n score) of being in the firms that exported.

grad_n= 683908% increase in the odds (for a one point increase in grad_n score) of being in the firms that exported.

involv_n:grad_n = -99.9% decrease in the odds (for a one point increase in involv_n score) of being in the firms that exported.

Pseudo R-squared

Analogue to an R-squared, this index is able to explain the variability of the model

pR2(LR)
## fitting null model for pseudo-r2
##           llh       llhNull            G2      McFadden          r2ML 
## -1.623946e+03 -1.773262e+03  2.986320e+02  8.420412e-02  1.060067e-01 
##          r2CU 
##  1.440840e-01

My McFadden R-squared is approximately 0.1. That indicates a quite good explanation of variability.

Assumptions

  • Linearity assumption
probabilities <- predict((LR), type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")

#Linearity assumption only for numerical predictors:
numpred <- Logistic %>%
  dplyr::select(involv_n,grad_n) 

#Bind Logit and and tidying the data for plot
data <- numpred %>%
  mutate(logit = log(probabilities/(1-probabilities))) %>%
  gather(key = "numpred", value = "predictor.value", -logit)

data$logit=remove_outliers(data$logit)

Linearity<-ggplot(data, aes(logit, predictor.value))+
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(method = "loess") + 
  theme_bw() + 
  facet_wrap(~numpred, scales = "free_y")
Linearity

In both graphs, there seem to be an approximative linear relationship, even if we have still some outliers that pull our line to the top.

  • No influential values
plot(LR, which = 4, id.n = 2)

From the graph we can see that we have two huge influential points. Let’s try to remove them and run another model.

# Cooks distance & Leverage for every point
influence <- cooks.distance(LR) 
leverage <- hatvalues(LR)

# Remove points that were very influential in our first model
Logistic2 <- Logistic %>%
    mutate(cooks_distance = influence, leverage = leverage) %>%
    filter(cooks_distance < 0.01) %>%
    filter(leverage < 0.05)

Without influential points

LR2 <- glm(Global_exporter ~country +mkt_innov +fam_ceo+involv_n*grad_n, data=Logistic2, family="binomial")
summary(LR2)
## 
## Call:
## glm(formula = Global_exporter ~ country + mkt_innov + fam_ceo + 
##     involv_n * grad_n, family = "binomial", data = Logistic2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2004  -0.9251  -0.6882   1.1612   2.3898  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.95582    0.29115  -3.283  0.00103 ** 
## countryFRA        0.17225    0.29884   0.576  0.56434    
## countryGER        0.10820    0.29377   0.368  0.71263    
## countryHUN       -1.32656    0.43386  -3.058  0.00223 ** 
## countryITA        0.91648    0.29629   3.093  0.00198 ** 
## countrySPA        0.55661    0.30002   1.855  0.06356 .  
## countryUK         0.50517    0.30474   1.658  0.09737 .  
## mkt_innov1        0.77858    0.08778   8.870  < 2e-16 ***
## fam_ceo1         -0.20411    0.09031  -2.260  0.02382 *  
## involv_n         10.17644    1.68146   6.052 1.43e-09 ***
## grad_n           11.94331    2.17403   5.494 3.94e-08 ***
## involv_n:grad_n -65.93723   11.30911  -5.830 5.53e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3534.0  on 2654  degrees of freedom
## Residual deviance: 3196.3  on 2643  degrees of freedom
## AIC: 3220.3
## 
## Number of Fisher Scoring iterations: 4
par(mfrow=c(2,2))
plot(LR2)

#Pseudo R-squared
pR2(LR2)
## fitting null model for pseudo-r2
##           llh       llhNull            G2      McFadden          r2ML 
## -1.598138e+03 -1.767003e+03  3.377307e+02  9.556595e-02  1.194473e-01 
##          r2CU 
##  1.623351e-01

Now our model is free from influential points and Even the McFadden coefficient increased

  • Multicollinearity
car::vif(LR)
##                     GVIF Df GVIF^(1/(2*Df))
## country         1.123999  6        1.009789
## fam_ceo         1.072582  1        1.035655
## mkt_innov       1.041033  1        1.020310
## involv_n        1.678298  1        1.295491
## grad_n          2.951623  1        1.718029
## involv_n:grad_n 2.455627  1        1.567044
car::vif(LR2)
##                     GVIF Df GVIF^(1/(2*Df))
## country         1.138037  6        1.010834
## mkt_innov       1.046030  1        1.022756
## fam_ceo         1.071465  1        1.035116
## involv_n        3.121307  1        1.766722
## grad_n          1.827780  1        1.351954
## involv_n:grad_n 3.081163  1        1.755324

In both cases we had not multicollinearity

  • Requires a large sample size
dim(Logistic)
## [1] 2665    8

Our sample size has these dimensions, so the assumption is checked.

Supervised Classification

In this case I set as labels the categorical variable that regards the external_financing of the firm. So my aim is predicting whether an unknown firm will be classified correctly in the group of firms that had external financing or not.

Let’s create a classification subset from our previous dataset:

Classif<-Efige %>%
  dplyr::select(external_financing,empl,no_family_n, family_n, skilled_n,
         unskilled_n,involv_n,grad_n,foreign_empl_n,core_p,po,f9,f10,f11,
         foreign_affiliate,white_n,exp_richness)%>%
  mutate(external_financing=as.factor(external_financing))

data1<-imputeMissings::compute(Classif, method = "median/mode")
Classif<-imputeMissings::impute(Classif, object = data1)

head(Classif)
##   external_financing empl no_family_n family_n skilled_n unskilled_n involv_n
## 1                  2  500          25        0        10           4      200
## 2                  1  500           0        0        10           4       10
## 3                  2  500          38        2        20         520       30
## 4                  2   30           4        2        15           4        1
## 5                  2   18           4        3         5           6        0
## 6                  2   25           4        2         8           5        1
##   grad_n foreign_empl_n core_p  po f9 f10 f11 foreign_affiliate white_n
## 1     35              0    100  10  2  60  12                 6       4
## 2     10              0    100  80  4  70  20                 2       4
## 3     30              0    100 100  2  60  12                 2      20
## 4      2              3    100  90  1  60  12                 2       5
## 5      1              4     95 100  3  60  12                 2       0
## 6      0              5     85   0  3  60  12                 2       6
##   exp_richness
## 1            0
## 2           50
## 3           40
## 4            6
## 5           10
## 6            3

Let’s see how much our classification rule KNN is able to predict well our labels using our features (that in this case are almost all variables).

K-Nearest Neighbous (KNN)

set.seed(2009)
random<- sample(rep(1:14759)) # Randomly generate numbers from 1 to 14759

Classif<- Classif[random,] # Randomize Classif

Classif2<- as.data.frame(lapply(Classif[,c(2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17)],normalize))

head(Classif2)
##         empl  no_family_n family_n skilled_n  unskilled_n     involv_n
## 1 0.01632653 0.0011538462  0.00050    0.0013 0.0003201024 0.0006666667
## 2 0.01020408 0.0007692308  0.00050    0.0006 0.0003201024 0.0010000000
## 3 0.01020408 0.0000000000  0.00025    0.0008 0.0009603073 0.0000000000
## 4 0.03061224 0.0030769231  0.00000    0.0000 0.0000000000 0.0003333333
## 5 0.02244898 0.0000000000  0.00000    0.0010 0.0012804097 0.0000000000
## 6 0.02244898 0.0000000000  0.00025    0.0013 0.0006402049 0.0007000000
##         grad_n foreign_empl_n core_p  po         f9  f10        f11
## 1 2.000002e-06   0.000000e+00   0.50 1.0 0.01020408 0.01 0.02040816
## 2 1.200001e-05   3.000003e-06   0.90 0.8 0.00000000 0.01 0.04081633
## 3 1.000001e-06   5.000005e-06   0.90 1.0 0.00000000 1.00 0.08163265
## 4 1.400001e-05   0.000000e+00   0.95 0.4 0.01020408 0.60 0.11224490
## 5 0.000000e+00   0.000000e+00   1.00 0.8 0.02040816 0.60 0.11224490
## 6 2.100002e-07   0.000000e+00   0.70 0.5 0.00000000 0.60 0.11224490
##   foreign_affiliate      white_n exp_richness
## 1       0.006666667 0.0003333333  0.006006006
## 2       0.006666667 0.0006666667  0.001001001
## 3       0.006666667 0.0005000000  0.001001001
## 4       0.006666667 0.0028333333  0.010010010
## 5       0.006666667 0.0006666667  0.003003003
## 6       0.006666667 0.0008333333  0.006006006
Efige.train<-Classif2[1:11807,]
Efige.train.label<- Classif[1:11807,1]

Efige.test<- Classif2[11807:14759 ,] #20% as test set
Efige.test.label<- Classif[11807:14759 ,1]

Classif2<-na.omit(Classif2)
predict<-knn(train=Efige.train, test=Efige.test, cl=Efige.train.label ,k=2)

Tab <- table(Efige.test.label, predict)
Tab # Confusion matrix
##                 predict
## Efige.test.label    1    2
##                1 1144  145
##                2  109 1555

SPECIFICITY, SENSITIVITY AND AUC

psych::AUC(Tab, plot = "a")

## Decision Theory and Area under the Curve
## 
## The original data implied the following 2 x 2 table
##                 predict
## Efige.test.label Predicted.Pos Predicted.Neg
##         True.Pos         0.387         0.049
##         True.Neg         0.037         0.527
## 
## Conditional probabilities of 
##                 predict
## Efige.test.label Predicted.Pos Predicted.Neg
##         True.Pos         0.888         0.112
##         True.Neg         0.066         0.934
## 
## Accuracy =  0.91  Sensitivity =  0.89   Specificity =  0.93 
## with Area Under the Curve =  0.97
## d.prime =  2.72  Criterion =  1.51  Beta =  0.74
## Observed Phi correlation =  0.82 
## Inferred latent (tetrachoric) correlation  =  0.96

In our case, both Sensitivity and specificity are very high.

The accuracy is pretty high 0.91, so it means that our classifier is very accurate in predicting labels.

The miss-classficiation error rate instead is simply 1-accuracy, so 0.09, meaning that we have few mistaken prediction of labels.

The AUC instead, indicates that our classifier has a very good (0.97) discriminating ability in recognizing firms that had external financing and firms that didn’t.

Plot Accuracy and Miss-classification error rate against neighbors

res=rep(NA,20)

for(i in 1:20){mod=knn(train=Efige.train, test=Efige.test, cl=Efige.train.label, k=i)
acc=sum(Efige.test.label==mod)
res[i]=1-acc/length(mod)}

par(mfrow=c(1,2))

plot(1:20,res,type="l",col="red", xlab="Neighbours", ylab="Misclassification error rate")
plot(1:20,1-res,type="l",col="blue", xlab="Neighbours", ylab="Accuracy")

As the plot shows, as we increase the number of neighbors, the miss-classification error rate increases,a and the accuracy viceversa. The optimal number of neighbors seems to be around 5, since after that point both the curve are stabilized.

KNN with 10-fold cross-validation

This approach instead is based on resampling methods, and it’s very useful to not lose information of our dataset to implement our classifier

TrainData <- Classif[,2:17]
TrainClasses <- Classif[,1]

control <- trainControl(method="cv", number=10)

CrossValid <- train(TrainData, TrainClasses,
method = "knn",
preProcess = c("center", "scale"),
tuneLength = 12,
trControl = control)
## Registered S3 method overwritten by 'e1071':
##   method       from  
##   print.fclust fclust
plot(CrossValid)

CrossValid$finalModel
## 5-nearest neighbor model
## Training set outcome distribution:
## 
##    1    2 
## 6344 8415
CrossValid$results
##     k  Accuracy     Kappa  AccuracySD    KappaSD
## 1   5 0.9237736 0.8441902 0.009213652 0.01903260
## 2   7 0.9226217 0.8417639 0.008664146 0.01792613
## 3   9 0.9198434 0.8359510 0.009027340 0.01869233
## 4  11 0.9180819 0.8322377 0.009421195 0.01956279
## 5  13 0.9163877 0.8287219 0.009270391 0.01923180
## 6  15 0.9157103 0.8272645 0.008868868 0.01840080
## 7  17 0.9138806 0.8234187 0.008485834 0.01767168
## 8  19 0.9131353 0.8218497 0.008683706 0.01811487
## 9  21 0.9109671 0.8172811 0.009333064 0.01946255
## 10 23 0.9101538 0.8155804 0.008731181 0.01820062
## 11 25 0.9085954 0.8122963 0.008987108 0.01874480
## 12 27 0.9081212 0.8112831 0.008921858 0.01861707

As expected, the best number of neighbors is 5 or less than 5, because if we go on, our accuracy will fall dramatically.

Clustering

Agglomerative Hierarchical Clustering

The feature selected for the unsupervised classification will be:

  • Number of employers;

  • Blue collar employers;

  • Number of banks;

  • Percentage of trained employers.

My aim is to understand if according to the selected predictors we can discover different patterns of data in our dataframe.

Selecting and pre-processing the data (normalizing predictors and removing outliers).

Clustering<-Efige%>%
  dplyr::select(empl,skilled_n,f9,training)%>%
  filter(country=="HUN"|country=="AUT"|country=="FRA"|country=="GER") # I choose to not select all countries for a matter of visualization, not because i want to find separation among countries!

data2<-imputeMissings::compute(Clustering, method = "median/mode")
Clustering<-imputeMissings::impute(Clustering, object = data2)

Clustering$training<-normalize(Clustering$training)
Clustering$f9<-normalize(Clustering$f9)
Clustering$empl<-normalize(Clustering$empl)
Clustering$skilled_n<-normalize(Clustering$skilled_n)

Clustering$training<-remove_outliers(Clustering$training)
Clustering$empl<-remove_outliers(Clustering$empl)
Clustering$skilled_n<-remove_outliers(Clustering$skilled_n)
Clustering$f9<-remove_outliers(Clustering$f9)

Clustering<-na.omit(Clustering)

head(Clustering)
##          empl skilled_n         f9 training
## 4  0.04081633    0.0015 0.00000000     0.01
## 5  0.01632653    0.0005 0.02040816     0.00
## 6  0.03061224    0.0008 0.02040816     0.13
## 7  0.01428571    0.0005 0.02040816     0.05
## 9  0.02448980    0.0005 0.01020408     0.15
## 11 0.01020408    0.0006 0.01020408     0.10

In this step we compute our distance matrix, plot our dendogram and using the Elbow and Sihlouette method to understand how many cluster we will obtain according to our features.

dd=dist(Clustering, method="manhattan") # Distance matrix

cluster = hclust(dd, method ="average")
plot(cluster)
rect.hclust(cluster, k=2, border = 2:3) #Separate groups of dendogram

plot1 <- fviz_nbclust(Clustering, FUN = hcut, method = "wss",
k.max = 10) +
ggtitle("Elbow method")

plot2 <- fviz_nbclust(Clustering, FUN = hcut, method = "silhouette",
k.max = 10) +
ggtitle("Silhouette method")

Plot <- grid.arrange(plot1,plot2, ncol=2)

cut<-cutree(cluster, 2)

Descriptive graph of the two clusters:

ggpairs(cbind(Clustering, cut=as.factor(cut)),
        columns=1:4, aes(colour=cut, alpha=0.5),
        lower=list(continuous="points"),
        upper=list(continuous="blank"),
        axisLabels="none", switch="both")+
        theme_bw()

From the dendogram we can see that we can identify two different clusters according to our features. Since now we cannot plot in a four dimension space, we must use a dimensionality reduction plot: PCA. This way of representing data is especially useful because it allows us to see the maximum spread of observations (if plotted on the first and second principal components).

We will try to plot our data on the factorial space along the first two components and see what we will obtain

fviz_cluster(list(data=Clustering, cluster=cut))+
  theme_minimal()

pc <- PCA(Clustering, ncp = 2 ,scale.unit = TRUE)

At first glance we can tell that both groups of firms have a large within sum of squares and a small between sum of squares, meaning that observations are not so close to each other with respect to the cluster.

The overlapping of the two cluster indicates a group of firms that in terms of employers, blue collars, number of banks and percentage of trained employers is very similar.

We can also plot a circlized Dendogram for better visualization of our clusters.

Clusters = hclust(dd,method = "ave")
dend <- as.dendrogram(Clusters)
dend <- color_branches(dend, k=2)
circlize_dendrogram(dend, labels = FALSE)

Unfortunately, the crisp clustering forces the observations to belong to a cluster or another, without considering that there could be a midway: Fuzzy clustering

Fuzzy clustering

Fuzzy clustering is based is another interesting approach to unsupervised classification, in which we are taking in account the degree of membership, in such a way that the observation can belong to more than one group.

Choice of groups based on three indexes that should be maximised:

  • The Modified Partition Coefficient index (MPC)

  • The Partition Coefficient index (PC)

  • The Fuzzy Silhouette index (FS)

Clustering=Clustering[,-1]

MPC=rep(NA,10)
for (i in 2:length(MPC)) {clusf=FKM(Clustering,i)
MPC[i]=MPC(clusf$U)}
MPC=MPC[-1]

PC=rep(NA,10)
for (i in 2:length(PC)) {clusf=FKM(Clustering,i)
PC[i]=PC(clusf$U)}
PC=PC[-1]

FS=rep(NA,10)
for (i in 2:length(FS)) {clusf=FKM(Clustering,i)
FS[i]=SIL.F(Clustering,clusf$U)}
## The default value alpha=1 has been set 
## The default value alpha=1 has been set 
## The default value alpha=1 has been set 
## The default value alpha=1 has been set 
## The default value alpha=1 has been set 
## The default value alpha=1 has been set 
## The default value alpha=1 has been set 
## The default value alpha=1 has been set 
## The default value alpha=1 has been set
FS=FS[-1]

Data=data.frame(cbind(MPC,PC,FS))
head(Data)
##         MPC        PC        FS
## 1 0.7457851 0.8728925 0.8826363
## 2 0.7441771 0.8294514 0.8595656
## 3 0.7468647 0.8101485 0.8525431
## 4 0.7814114 0.8251291 0.8788448
## 5 0.7679537 0.8053658 0.8479812
## 6 0.7742184 0.8064730 0.8465916

Plot all three together to understand the optimal number of clusters:

x=c(2:10)
for (i in 1:ncol(Data)) {plot(x,Data[,i],type="l",lty=i,lwd=2,ylim=c(0,max(Data)+0.5),xlim=c(1,10),main="Number of Group Choice", xlab="Number of Groups",ylab="Index Value")
par(new=TRUE)}

legend("topright",c("MPC","PC","SIL.F"),lty=1:3,lwd=2)

max MPC = 8

max SIL.F = 2

max PC = 2

So the optimal choice of clusters is 2

fannyx=fanny(Clustering,2)
fanni=fanny(Clustering, 2, diss = FALSE, memb.exp = 2, metric = "euclidean",
stand = FALSE, maxit = 500)

fviz_cluster(fanni,repel=TRUE)+
  theme_minimal()+
  scale_fill_brewer(palette = "Set1")

From the plot above we can see that firms in the intersection of the two clusters belong (with different degree of membership) to both clusters.

Actually, plotting our observations in the two principal components we are explaining just the 70% of our variability.

Let’s now plot the degree of membership scatterplot to have a better visualization of our observation belonging to each cluster:

pc <- PCA(Clustering, ncp = 2 ,scale.unit = TRUE, graph = FALSE)

coo=pc$ind$coord
dataf<-data.frame(coo[,1],coo[,2])
x1=coo[,1]
x2=coo[,2]
colnames(dataf)<-c("x1","x2")
Name=rownames(dataf)

Fuzzyplot1<-ggplot(Clustering, aes(x=x1, y=x2, colour=fanni$membership[,1])) +
geom_point(size=1)+
scale_colour_gradient("",low="green",high="red") +
theme_minimal()+
ggtitle("Fuzzy Clustering - Cluster 1") 

Fuzzyplot2<-ggplot(Clustering, aes(x=x1, y=x2, colour=fanni$membership[,2])) +
geom_point(size=1)+
scale_colour_gradient("",low="green",high="red") +
theme_minimal()+
ggtitle("Fuzzy Clustering - Cluster 2") 

Fuzzyplot1

Fuzzyplot2

The orange shaded observations are intended as firms that belong to both clusters